feat: Phase 3 Advanced Features - structure extraction, OCR, caching, tracking by krisoye · Pull Request #9 · krisoye/document-analysis-mcp

krisoye · 2026-02-02T03:36:43Z

Summary

Implements Phase 3 of Epic #21 (Document Analysis MCP Server) with advanced features:

pdf_ocr tool - OCR for scanned PDFs using Tesseract
pdf_extract_structure tool - Extract TOC, tables, and section headings
Document caching - Hash-based deduplication with configurable TTL
Usage tracking - Token logging and cost estimation per operation
Utility tools - cache_stats and usage_summary for monitoring

Changes

New Tools

Tool	Description
`pdf_ocr`	OCR for image-based PDFs with Tesseract
`pdf_extract_structure`	Extract document structure (TOC, tables, headings)
`cache_stats`	View cache statistics
`usage_summary`	View API usage and costs

New Modules

src/document_analysis_mcp/cache/__init__.py - Hash-based caching
src/document_analysis_mcp/tracking/__init__.py - Usage tracking
src/document_analysis_mcp/tools/ocr.py - OCR tool
src/document_analysis_mcp/tools/structure.py - Structure extraction

Updated Files

server.py - Register new tools, v0.3.0
pyproject.toml - Version bump

Tests

210 tests passing
Full coverage for new modules
Tests for cache, tracking, OCR, and structure extraction

Test Plan

Closes krisoye/project-tracker#96

Generated with Claude Code

… tracking Implements Phase 3 of Epic #21 (Document Analysis MCP Server): ## New Tools - **pdf_ocr** - OCR for scanned PDFs using Tesseract - Automatic fallback when text extraction fails - Configurable language and DPI - Force OCR option for guaranteed image-to-text - **pdf_extract_structure** - Document structure extraction - Table of Contents (TOC) detection - Table extraction with markdown formatting - Section/heading hierarchy detection ## New Modules - **cache/** - Hash-based document caching - SHA-256 content hashing for deduplication - Configurable TTL via CACHE_TTL_DAYS env - Automatic cleanup of expired entries - Parameter-aware cache keys for different tool configs - **tracking/** - API usage tracking - Token usage logging per operation - Cost estimation by model - Daily summary reports ## Additional Tools - **cache_stats** - View cache statistics - **usage_summary** - View API usage and costs ## Version - Bumped to v0.3.0 Closes krisoye/project-tracker#96 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add file locking (fcntl) to cache metadata operations for concurrent access - Add threading.Lock for in-memory cache metadata protection - Add file locking to usage tracking append and read operations - Add language parameter validation for OCR tool with VALID_LANGUAGES set - Add atomic metadata writes using temp file + rename pattern - Add comprehensive concurrent operation tests for cache and tracking Thread safety improvements: - Cache: _save_metadata() uses exclusive lock with atomic write - Cache: _load_metadata() uses shared lock for concurrent reads - Cache: All metadata modifications protected by threading.Lock - Tracking: record() uses exclusive lock for append operations - Tracking: get_records() uses shared lock for read operations Input validation: - OCR: Invalid language codes log warning and fall back to "eng" - OCR: VALID_LANGUAGES includes 28 common Tesseract language codes Fixes issues identified in QA review of PR #9 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add file locking (fcntl) to cache metadata operations for concurrent access - Add threading.Lock for in-memory cache metadata protection - Add file locking to usage tracking append and read operations - Add language parameter validation for OCR tool with VALID_LANGUAGES set - Add atomic metadata writes using temp file + rename pattern - Add comprehensive concurrent operation tests for cache and tracking Thread safety improvements: - Cache: _save_metadata() uses exclusive lock with atomic write - Cache: _load_metadata() uses shared lock for concurrent reads - Cache: All metadata modifications protected by threading.Lock - Tracking: record() uses exclusive lock for append operations - Tracking: get_records() uses shared lock for read operations Input validation: - OCR: Invalid language codes log warning and fall back to "eng" - OCR: VALID_LANGUAGES includes 28 common Tesseract language codes Fixes issues identified in QA review of PR #9 Co-authored-by: Krisoye Smith <krisoye@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

krisoye13 approved these changes Feb 2, 2026

View reviewed changes

krisoye13 merged commit 2de9e22 into main Feb 2, 2026
4 checks passed

krisoye mentioned this pull request Feb 2, 2026

fix: add thread safety and input validation #10

Merged

krisoye deleted the feature/phase3-features branch February 6, 2026 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 3 Advanced Features - structure extraction, OCR, caching, tracking#9

feat: Phase 3 Advanced Features - structure extraction, OCR, caching, tracking#9
krisoye13 merged 1 commit intomainfrom
feature/phase3-features

krisoye commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krisoye commented Feb 2, 2026

Summary

Changes

New Tools

New Modules

Updated Files

Tests

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants